Next up on my to-learn list is dplyr. I use group_by and summarize fairly regularly, but there is other functionality that I want to learn to take advantage of.
mutateselectfilterarrangepull (instead of my favorite, $)merge)In anticipation of The Bachelorette starting at the end of May, I’m going to wrangle some data on previous contestants. Yes, watching The Bachelorette/The Bachelor is my guilty pleasure. But this feels fitting because the dollar sign is my R guilty pleasure. I think this will be the hardest habit to break, but here I try to gain some muscle memory for mutate and pull.
I feel the disapproval already, but read this super-scientific article before judging me.
require(stringr)
require(dplyr)
require(ggmap)
require(fuzzyjoin)
require(maps)
require(tigris)
require(sp)
require(acs)
Luckily, both 538 and Kaggle have some data on this, so I don’t have to do any web scraping.
#https://github.com/fivethirtyeight/data/tree/master/bachelorette
#https://www.kaggle.com/brianbgonz/the-bachelor-contestants/data
setwd("~/Desktop/data-538/bachelorette")
contestants<-read.csv("contestants.csv",stringsAsFactors=F) ## just female contestants
bachelors<-read.csv("bachelors.csv",stringsAsFactors=F) ## bachelors
weekByWeek<-read.csv("bachelorette.csv",stringsAsFactors=F) ## both
These datasets contain different information, so the goal of this exercise is to wrangle them together and do something simple with the result. Plenty of others have done fancier stuff with this kind of data. I’m just trying to learn some new tidy verbs.
head(contestants)
## Name Age Occupation Hometown
## 1 Amanda Marsh 23 Event Planner Chanute, Kansas
## 2 Trista Rehn 29 Miami Heat Dancer Miami, Florida
## 3 Shannon Oliver 24 Financial Management Consultant Dallas, Texas
## 4 Kim 24 Nanny Tempe, Arizona
## 5 Cathy Grimes 22 Graduate Student Terra Haute, Indiana
## 6 Christina 28 Attorney Bonita, California
## Height ElimWeek Season
## 1 NA NA 1
## 2 NA 6 1
## 3 NA 5 1
## 4 NA 4 1
## 5 NA 3 1
## 6 NA 3 1
head(bachelors)
## Name Age Hometown Height Season
## 1 Alex Michel 32 Charlottesville, Virginia NA 1
## 2 Aaron Buerge 28 Butler, Missouri NA 2
## 3 Jesse Palmer 34 Toronto, Ontario NA 5
## 4 Lorenzo Borghese 34 Milan, Italy NA 9
## 5 Andy Baldwin 30 Lancaster, Pennsylvania NA 10
## 6 Brad Womack 35 Austin, Texas NA 11
head(weekByWeek,2)
## SHOW SEASON CONTESTANT ELIMINATION.1 ELIMINATION.2 ELIMINATION.3
## 1 SHOW SEASON ID 1 2 3
## 2 Bachelorette 13 13_BRYAN_A R1
## ELIMINATION.4 ELIMINATION.5 ELIMINATION.6 ELIMINATION.7 ELIMINATION.8
## 1 4 5 6 7 8
## 2 R R R
## ELIMINATION.9 ELIMINATION.10 DATES.1 DATES.2 DATES.3 DATES.4 DATES.5
## 1 9 10 1 2 3 4 5
## 2 W D6 D13 D1
## DATES.6 DATES.7 DATES.8 DATES.9 DATES.10
## 1 6 7 8 9 10
## 2 D7 D1 D1 D1 D1
Get rid of headers.
weekByWeek=weekByWeek[-which(weekByWeek$SEASON=="SEASON"),]
Use select and mutate to add the number of dates.
Usually I would usually just use the dollar sign to add new columns (and I would just manually specify the dates columns).
dates=select(weekByWeek,starts_with("DATES"))
weekByWeek=weekByWeek %>% mutate(numOneOnOneDates=apply(dates,1,function(x){length(which(x=="D1"))}))
weekByWeek=weekByWeek %>% mutate(numDates=apply(dates,1,function(x){length(which(x!=""))}))
weekByWeek=weekByWeek %>% mutate(numGroupDates=numDates-numOneOnOneDates) ##
Get the names ready to join using mutate and pull. To standardize each data set, I want first names and last initials in all capitals.
Note: The string processing here is rudimentary. stringr will have its own blog post.
weekByWeek= weekByWeek %>% mutate(firstName=unlist(lapply(pull(weekByWeek, CONTESTANT),
function(x){unlist(str_split(x, "_"))[2]})))
weekByWeek= weekByWeek %>% mutate(lastInitial=unlist(lapply(pull(weekByWeek, CONTESTANT),
function(x){unlist(str_split(x, "_"))[3]})))
weekByWeek=weekByWeek %>% mutate(lastInitial=unlist(lapply(pull(weekByWeek,lastInitial),function(x){ifelse(is.na(x),"",x)})))
weekByWeek=weekByWeek %>% mutate(nameNice=paste(firstName,lastInitial,sep=" "))
contestants=contestants %>% mutate(firstName=unlist(lapply(pull(contestants,Name),function(x){unlist(str_split(x," "))[1]})))
contestants=contestants %>% mutate(lastName=unlist(lapply(pull(contestants,Name),function(x){unlist(str_split(x," "))[2]})))
contestants=contestants %>% mutate(lastInitial=unlist(lapply(pull(contestants,lastName),function(x){unlist(str_split(x,""))[1]})))
contestants=contestants %>% mutate(lastInitial=unlist(lapply(pull(contestants,lastInitial),function(x){ifelse(is.na(x),"",x)})))
contestants=contestants %>% mutate(nameNice=toupper(paste(firstName,lastInitial,sep=" ")))
Deal with some weird entries (nicknames have parentheses that mess things up).
contestants[which(grepl("\\(",contestants$Name)),]
## Name Age Occupation
## 100 Britt (Bowe) Newton 28 Beer chemist
## 212 Elizabeth Kitt (NE) 29 Nanny
## 218 Alexa (Lex) McAllister 25 Entrepreneur
## 221 Elizabeth Kreft (D.C.) 29 Captain, Air National Guard
## Hometown Height ElimWeek Season firstName lastName
## 100 Columbus, OH NA 1 9 Britt (Bowe)
## 212 Imperial, Nebraska NA 3 14 Elizabeth Kitt
## 218 Galloway, Ohio NA 1 14 Alexa (Lex)
## 221 Union, Kentucky NA 1 14 Elizabeth Kreft
## lastInitial nameNice
## 100 ( BRITT (
## 212 K ELIZABETH K
## 218 ( ALEXA (
## 221 K ELIZABETH K
contestants$nameNice[which(grepl("Bowe",contestants$Name))]="Britt N"
contestants$nameNice[which(grepl("McAllister",contestants$Name))]="Alexa A"
Before we start merging, we should have a sense of what the best we can do is. Kaggle warns us that the data is missing some seasons.
dim(contestants)
## [1] 423 11
dim(weekByWeek)
## [1] 887 29
I usually use merge, but dplyr focuses on inner_join, left_join, right_join, etc. so let’s get used to that syntax.
weekByWeek=weekByWeek %>% mutate(SEASON=as.numeric(SEASON))
tryMerge=inner_join(contestants,weekByWeek,by=c("nameNice"="nameNice","Season"="SEASON"))
dim(tryMerge)
## [1] 313 38
Because some contestants don’t have last names listed in the Kaggle data, we are losing a lot of rows. Within a season we should be able to do a rough join instead to recover some of these. However, there are seasons where multiple contestants have the same first name. We can see that duplicates do occur with this fuzzy join.
mergedData=weekByWeek %>% regex_inner_join(contestants, by = c(nameNice = "nameNice",SEASON="Season"))
dim(mergedData)
## [1] 433 40
length(which(duplicated(mergedData$CONTESTANT)))
## [1] 41
Since one of the datasets only has the female contestants, our merged file will only have the women in it. Where are the female contestants from?
coordinates=geocode(pull(mergedData,Hometown),output="latlon")
write.csv(coordinates,"bachelorCoords.csv",row.names=F)
Using my ggplot skills from previous posts…
all_states <- map_data("state")
p <- ggplot()+ geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="black", fill="white" )
p <- p+ geom_point(data=coordinates,aes(x=lon,y=lat))+xlim(-125,-60)+ylim(25,50)+theme_void()
p
Let’s find out how many per state (and take advantage of arrange).
More building off of previous ggplot skills…
pts = SpatialPoints(coordinates[complete.cases(coordinates),])
#https://journal.r-project.org/archive/2016/RJ-2016-043/RJ-2016-043.pdf
## There is probably an easier way to do this.
us_states <- unique(fips_codes$state)[1:51]
continental_states <- us_states[!us_states %in% c("AK", "HI")]
us_pumas <- rbind_tigris(
lapply(
continental_states, function(x) {
pumas(state = x, cb = TRUE)
}
)
)
proj4string(pts)=proj4string(us_pumas) ## this is needed for over
withinContinental=over(pts,us_pumas)
byState=group_by(withinContinental, STATEFP10) %>% summarise(count=n())
byState=byState %>% mutate(STATEFP10=as.numeric(STATEFP10))
byState=inner_join(byState,fips.state,by=c("STATEFP10"="STATE"))%>% arrange(desc(count))
I’m curious if the males participating on The Bachelorette are from the same types of places. It seems to me like small town southern girls are more represented than small town southern boys, but I want to test this theory. Let’s filter for the South to at least start getting a sense.
byState
## # A tibble: 39 x 5
## STATEFP10 count STUSAB STATE_NAME STATENS
## <dbl> <int> <chr> <chr> <int>
## 1 6. 46 CA California 1779778
## 2 48. 37 TX Texas 1779801
## 3 12. 27 FL Florida 294478
## 4 36. 21 NY New York 1779796
## 5 17. 19 IL Illinois 1779784
## 6 26. 13 MI Michigan 1779789
## 7 42. 12 PA Pennsylvania 1779798
## 8 4. 11 AZ Arizona 1779777
## 9 47. 10 TN Tennessee 1325873
## 10 49. 10 UT Utah 1455989
## # ... with 29 more rows
region=cbind.data.frame(state.abb, as.character(state.region))
names(region)=c("abb","reg")
south=inner_join(byState,region,by=c("STUSAB"="abb"))%>%filter(reg=="South")
south
## # A tibble: 15 x 6
## STATEFP10 count STUSAB STATE_NAME STATENS reg
## <dbl> <int> <chr> <chr> <int> <fct>
## 1 48. 37 TX Texas 1779801 South
## 2 12. 27 FL Florida 294478 South
## 3 47. 10 TN Tennessee 1325873 South
## 4 37. 8 NC North Carolina 1027616 South
## 5 21. 6 KY Kentucky 1779786 South
## 6 40. 6 OK Oklahoma 1102857 South
## 7 13. 5 GA Georgia 1705317 South
## 8 45. 4 SC South Carolina 1779799 South
## 9 51. 4 VA Virginia 1779803 South
## 10 5. 3 AR Arkansas 68085 South
## 11 24. 3 MD Maryland 1714934 South
## 12 1. 2 AL Alabama 1779775 South
## 13 22. 2 LA Louisiana 1629543 South
## 14 28. 1 MS Mississippi 1779790 South
## 15 54. 1 WV West Virginia 1779805 South
sum(south$count)/sum(byState$count)
## [1] 0.3541667
That covers the dplyr verbs, and I’m starting to get the hang of it. I think replacing the dollar sign with mutate will come more easily, but I fear that pull will always be a stretch for me. Fingers crossed I can kick the dollar sign habit.
R guilty pleasure?